MT Evaluation: Human-Like vs. Human Acceptable
نویسندگان
چکیده
We present a comparative study on Machine Translation Evaluation according to two different criteria: Human Likeness and Human Acceptability. We provide empirical evidence that there is a relationship between these two kinds of evaluation: Human Likeness implies Human Acceptability but the reverse is not true. From the point of view of automatic evaluation this implies that metrics based on Human Likeness are more reliable for system tuning. Our results also show that current evaluation metrics are not always able to distinguish between automatic and human translations. In order to improve the descriptive power of current metrics we propose the use of additional syntax-based metrics, and metric combinations inside the QARLA Framework.
منابع مشابه
Semantic vs. Syntactic vs. N-gram Structure for Machine Translation Evaluation
We present results of an empirical study on evaluating the utility of the machine translation output, by assessing the accuracy with which human readers are able to complete the semantic role annotation templates. Unlike the widely-used lexical and n-gram based or syntactic based MT evaluation metrics which are fluencyoriented, our results show that using semantic role labels to evaluate the ut...
متن کاملExperimental Comparison of MT Evaluation Methods: RED vs. BLEU
This paper experimentally compares two automatic evaluators, RED and BLEU, to determine how close the evaluation results of each automatic evaluator are to average evaluation results by human evaluators, following the ATR standard of MT evaluation. This paper gives several cautionary remarks intended to prevent MT developers from drawing misleading conclusions when using the automatic evaluator...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملارائه یک سیستم هوشمند و معناگرا برای ارزیابی سیستم های خلاصه ساز متون
Nowadays summarizers and machine translators have attracted much attention to themselves, and many activities on making such tools have been done around the world. For Farsi like the other languages there have been efforts in this field. So evaluating such tools has a great importance. Human evaluations of machine summarization are extensive but expensive. Human evaluations can take months to f...
متن کاملUnsupervised vs. supervised weight estimation for semantic MT evaluation metrics
We present an unsupervised approach to estimate the appropriate degree of contribution of each semantic role type for semantic translation evaluation, yielding a semantic MT evaluation metric whose correlation with human adequacy judgments is comparable to that of recent supervised approaches but without the high cost of a human-ranked training corpus. Our new unsupervised estimation approach i...
متن کامل